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ABSTRACT 

The increasing complexity of the software/hardware stack of 
modern supercomputers has resulted in an explosion in the 
parameter space for performance tuning. In many ways per¬ 
formance analysis has become an experimental science, made 
even more challenging due to the presence of massive irregu¬ 
larity and data dependency in important emerging problem 
areas. As with any experimental science, a characterization 
of experimental conditions is important for identifying which 
variables in the experiment affect the outcome. To gain 
insight into the experimental nature of performance analysis, 
we analyze how the existing body of research handles the 
particular case of distributed graph algorithms (DGAs). We 
distinguish algorithm-level contributions, often prioritized 
by authors, from runtime-level concerns that are harder to 
place. With a careful exposition of the tuning process for a 
high-performance graph algorithm, we show that the runtime 
is such an integral part of DGAs that experimental results are 
difficult to interpret and extrapolate without understanding 
the properties of the runtime used. We argue that in order 
to gain understanding about the impact of runtimes, more 
information needs to be gathered. To begin this process, 
we provide an initial set of recommendations for describing 
DGA results based on our analysis of the current state of 
the field. 

Categories and Subject Descriptors 

D.1.3 [Programming Techniques]: Goncurrent Program¬ 
ming —Distributed programming 
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1. INTRODUCTION 

Large irregular applications are gaining recognition as the 
future challenge in parallel computing. This is reflected by 
the GraphbOO benchmark |22j . the subject of which is the 
prototypical irregular problem of graph traversal. Graph 
traversal is a basic building blocks of other graph algorithms 
used in social network analytics, transportation optimization, 
artificial intelligence, power grids, and, in general, any prob¬ 
lem where data consists of entities that connect and interact 
in irregular ways. The current GraphSOO benchmark is based 
on breadth-first search (BPS) with a proposal to extend the 
benchmark with single-source shortest paths (SSSP). In this 
paper, we concentrate on BPS and SSSP for the same reasons, 
i.e., as representatives of a class of irregular graph problems. 

Research on distributed graph algorithms is an emerging 
and active field. New algorithms, new approaches to dis¬ 
tribute the data and new performance results appear at most 
major distributed computing conferences. The GraphSOO 
benchmark bears witness to the progress, with the best re¬ 
sults progressing from 7 GTEPS (billions of traversed edges 
per second) in 2010, 253 GTEPS in 2011, 15 TTEPS (tril¬ 
lions of TEPS) in 2012, to 23 TTEPS in 2014. Many new 
algorithmic techniques have been developed, e.g., direction 
optimization mil] , pruning m, k-level asynchronous algo¬ 
rithm m, hybrid algorithms m, and distributed control 
[33] . A practitioner faces a multitude of published approaches, 
which are often vague on low-level details of implementations. 

Graph problems are difficult to fit to the conventional High 
Performance Gomputing (HPG) platforms. Over the decades 
of development, these platforms have been optimized for 
problems exhibiting good locality and regular memory access 
and communication patterns, benefiting from caching and 
high-bandwidth regular collective operations. In contrast, 
graph algorithms exhibit little locality, rarely require any 
significant computation per memory access, and result in 
high-rate communication of small messages. Unlike regular 
applications that are built on top of well understood regular 
communication and memory access, graph algorithms inter¬ 
act with the whole software and hardware stack in a complex 
way due to their data-driven, fine-grained, irregular nature 
of tasks. Each piece of the stack, designed independently, 
from the application level, through the transport layer, to the 
hardware layer and the topology of the physical network, in¬ 
teracts within the system in unpredictable ways. This makes 
designing distributed graph algorithms a truly experimental 
science, and this state of affairs will be only exacerbated as 
we move towards exascale computing. 

We argue that the advancements in the field are hard to 


generalize and reconcile because the information reported is 
commonly incomplete. The low-level details of implemen¬ 
tations are often vague or missing. Yet, these can have 
important impact. In this paper we present evidence of im¬ 
pact of low-level transport, scheduler, and hardware, which 
we refer to as runtime. It should be noted that the com¬ 
plexity of the interactions between high-level algorithm and 
low-level runtime that we expose is not unknown to the scien¬ 
tists in the field. This knowledge is implicit, fragmented, and 
often sidelined in presentation of new techniques. Notably, 
Checconi and Petrini na, who achieve the top results in the 
Graph 500 benchmark in part due to direct access to the 
SPI (System Programming Interface) low-level primitives, 
provide an outstanding analysis of their evolving implemen¬ 
tation, including a 3-years timeline of changing conclusions 
and understanding. However, this is not typical for the field. 
We point out a need for community standards and guidelines 
for presenting experimental results. 


Contributions: 

A motivating case study ( |Sec. 2[ ); 

A survey and analysis of the field (Sec. 3), identifying, 
classifying, and discussing two levels of distribut ed graph 
algorithms: (i) Application-level aspects (Sec. 3.1) that au¬ 
thors identify as the main algorithmic contributions of their 
research; and (ii) Runtime-level aspects (Sec. 3.2) that au¬ 
thors do not explicitly consider a part of the algorithm but 
that play a crucial role in the overall performance; 

An initial set of recommendations (Sec. 4) for authors 
to consider when describing their research. 


2. MOTIVATING CASE STUDY 

Pingali et al. m classify algorithms into two main cate¬ 
gories: ordered and unordered. Ordered algorithms require or¬ 
dering of tasks for correctness whereas unordered algorithms 
do not. Parallelizing ordered algorithms is challenging as 
parallel execution must maintain the ordering. Unnordered 
algorithms are easier to parallelize as tasks can be executed in 
any order. Although the order of execution does not impact 
correctness in unordered algorithms, a task priority can be 
used to partially order tasks, improving performance. Our 
distributed control (DC) approach [33] is a work scheduling 
method for distributed unordered algorithms that benefit 
from task priority. The goal of DC is to remove the overhead 
of synchronization and global data structures by using only 
local knowledge to select best work, thus obtaining an ap¬ 
proximation of the global ordering. Because DC does not use 
global data structures and synchronization, it is particularly 
sensitive to the runtime characteristics such as timing of task 
execution, communication latency, and buffering. In fact, 
their effect can be so significant as to make DG an infeasible 
approach if not handled carefully. In this section, we discuss 
the significant runtime effects we have observed. 

[Figure l] illustrates DC for SSSP. A distributed system con¬ 
sists of workers divided into several shared memory domains. 
Each domain contains a part of the global data. Processing 
a task on one domain may generate more tasks that depend 
on data on other domains. Remote tasks are communicated 
through an unordered global task bag where every worker 
puts the tasks it generates. Workers continuously try to 
retrieve tasks from the bag into their private working sets 
that are ordered according to task priority. Because the task 
bag is unordered, the more tasks in the bag and not in the 
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Figure 1: An overview of Distributed Control, using SSSP as 
an example. 


ordered private priority queues of workers, the further the 
approximated ordering is from the ideal least-work ordering 
(Dijkstra’s priority queue). 

Ideally, the underlying runtime system delivers tasks to 
the appropriate ordered private workset as soon as possible. 
On the other hand, quick delivery comes at a cost: the accu¬ 
mulative costs of network sends overhead, the necessity for 
frequent polling, frequent context switches when handling 
small tasks, and so on, add up to a significant overhead. 
To balance these competing needs, we use the AM++ [29] 
runtime. AM-h-1- is particularly suitable for this purpose be¬ 
cause it supports fine-grained parallelism of active messages 
with communication optimization techniques such as scal¬ 
able addressing, active routing, message coalescing, message 
reduction, and termination detection. 

In the remainder of this section, we discuss the character¬ 
istics of the runtime we experimented with and the impacts 
we observed. Our experiments were run at different times on 
different machines (Cray environments). The experiments 
we discuss here were ran on Big Red 2 at Indiana Univer¬ 
sity m and on Edison at NERSC [4j . The most important 
differences between the two machines are the topologies, 3-D 
torus on Big Red 2 vs. Dragonfly on Edison, and the MPI im¬ 
plementations, Cray’s Message Passing Toolkit (MPT) 6.2.2 
on Big Red 2 vs. MPT 7.1.1 on Edison. All experiments 
were run on Graph500 graphs. 

2.1 Runtime Parameters 

DC consists of layers, each with its own set of possible 
design choices and parameters. The exact set of parameters 
suitable for an application depends on the specifics of the 
machine (network overheads, etc.) and on the input (graph 
structure, edge weights, etc.). Some examples of parameters 
in our implementation include: 

• Progress related parameters such as the threshold for 
eager progress and the frequency of progress calls. 

• Optimizations such as reduction cache size, priority 
messaging, and “self-send” check. 

• Coalescing buffers size, controlling the maximum num¬ 
ber of messages that can be sent together. 

• MPI-related parameters such as receive depth, the num- 































20 

96 104 112 120 128 

# nodes 

Figure 2: Effect of coalescing size on DC SSSP algorithm on 
a scale 31 graph (Big Red 2). 
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Figure 3: Effect of coalescing size on DC SSSP algorithm on 
a scale 31 graph (Edison). 


her of polling tasks, and the number of outstanding 
sends allowed. 

We discuss some of these parameters in more detail in the 
remainder of this section. In addition to parameters directly 
related to AM++ and our algorithm, there are environment 
parameters, including: 

• MPI progress model, with several choices on the ma¬ 
chines that we have run on. 

• Job placement, which can severely impact performance. 

• Use of huge pages modules on Cray machines. 

The parameter space is further enlarged by the characteristics 
of inputs. For example, we observed that edge weights have 
a significant impact on the choice of optimal parameters for 
DC. 


2.2 Coalescing Size 

To increase bandwidth utilization, AM++ performs mes¬ 
sage eoalescing, combining multiple messages sent to the 
same destination into a single, larger message. Messages are 
appended to per-destination buffers, and to handle partially 
filled buffers, a periodic check is performed to check for ac¬ 
tivity. In the case oi DC SSSP, a single message consists of a 
tuple of a destination vertex and distance, 12 bytes in total. 
With such small messages, coalescing has great impact on 
the performance, but finding the optimal size is difficult. 

We investigated the impact of coalescing in graphSOO scale 
31 gra phs whe n runnin g DC SSS P with max edge weight 
of 100(Figs. 2 and|^. [Figure ^ shows the large impact 
of a small change in the coalescing size, measured by the 
number of SSSP messages per coalescing buffer. Changing 
the coalescing size by less than 2% causes over 50% increase 
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Figure 4: Effect of coalescing size on DC BFS algorithm on 
a scale 31 graph (Big Red 2). 
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Figure 5: Effect of asynchronous progress on Big Red 2. 


in the run time. This unexpected effect is caused by the 
specifics of Cray MPI protocols. At the smaller coalescing 
size, full message buffers fit into rendezvous RO protocol 
that sends messages of up to 512K using one RDM A GET, 
while the larger buffers hit R1 protocol that sends chunks of 
512K using RDMA PUT operations. At the size of 44000, 
the bulk of the message fits into the first 512K buffer, and 
the small remainder requires another RDMA PUT, causing 
overheads. The sizes 43000 and 86000 fill out 1 and 2 buffers, 
respectively, achieving similar performance. The larger size, 
86000, results in better scaling properties. 

We ran a more extensive suite of benchmarks on Edison. 
[Figure 3| shows the coalescing buffer size experiments on 
Edison. The results are similar, with a periodic increases 
in the minimum run time as protocol buffers mismatch the 
coalescing buffers. The maximum run times signify the worst 
run time, as other parameters than coalescing are adjusted. 
The results show that adjusting other parameters is less and 
less important as the coalescing buffer size increases. 

[Figure 4] shows the effects of coalescing on a DC BFS, 
which is SSSP with maximum weight of 1. Surprisingly, in¬ 
creasing the coalescing size impacts performance negatively. 
We suspect that with smaller weights the possibility of re¬ 
ward from optimistic parallelism \n DC decreases, and the 
added latency of coalescing has a much larger effect than 
with larger weights. Also note that we have not actually 
discovered the optimal coalescing size, which would require 
more experiments and more resources. All three cases shown 
in [Figs. 2[ to [^ show that adjusting the coalescing size is 
important, and the optimal value is not static. Rather, it de¬ 
pends on algorithmic concerns such as reward from optimistic 
parallelism. 

2.3 Transport Progress 

At first, when we experimented with DC on Big Red 2, we 
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Figure 6: Effect of AM++ progress parameters. 


found out that it was performing worse than A-stepping. This 
raised a concern that the DC approach may not be practical. 
We suspected the possibility of message latencies being a 
culprit; so, upon researching MPT, we decided to experiment 
with asynchronous progress, which uses separate threads to 
perform progress in certain situations. Despite Cray’s warn¬ 
ing at the time that thread-multiple progress required for 
asynchronous progress “is not considered a high-performance 
implementation”, we observed significant gains for DC, shown 
in |Fig. 5] We ran the experiment on GraphSOO scale 31 opti¬ 
mal strong scaling results. Without asynchronous progress, 
performance decreased with the increased number of nodes 
(with an unexplained anomaly at 112 nodes). (Note that 
all our experiments are averaged; thus, large anomalies are 
indicative of unexpected circumstances.) With asynchronous 
progress thread, the performance of DC has improved more 
than tenfold with growing node counts, entirely changing 
the viability of the approach. This dramatic effect illustrates 
how deeply an algorithm interacts with the runtime, and how 
a gap in parameter space may lead to incorrect conclusions 
about DGA approaches. Interestingly, we did not observe a 
similar effect on Edison, where two different asynchronous 
progress and the standard progress modes perform similarly. 

2.4 Distributed Control Progress 

In addition to transport layer progress, AM++ performs 
its own internal progress when AM++ interfaces are called. 
Since DC is built around a loop that empties the local priority 
data structure, it must occasionally, with some frequency call 
into the appropriate AM++ interfaces that perform progress. 
This frequency is controlled by 2 parameters: the end-epoch 
test frequency (EE) and the eager progress limit (EL). EE 
controls how many iterations of the DC loop run before 
AM++ progress is invoked. The eager limit is a threshold 
of outstanding DC tasks below which AM++ progress is 
performed every iteration of the DC loop. 

[Figure 6] shows the effects of progress parameters, using 
performance data averaged over multiple runs while varying 
orthogonal parameters. Edison shows a significant sensitivity 
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Figure 7: The impact of partial and full buffer counts on 
performance with coalescing size fixed at 100000 (Edison). 

to the EE parameter. Smaller values are better, with 22 
being the best of the ones tested. This suggests that latency 
may be a limiting factor on Edison. On Big Red 2, the 
results of varying the EE parameter are less pronounced, 
but the average of multiple experiments that we show here 
still suggests some sensitivity with the optimal value similar 
to that on Edison. Altogether, the results show that the 
performance of DC depends on the progress model. 

2.5 Buffering and Work Efficiency 

The prerogative of coalescing in AM++ is to decrease the 
overhead by sending as many full coalescing buffers as pos¬ 
sible. Partially filled buffers are only sent when no more 
messages are being inserted. [Eigure 7] shows DC results on 
Edison for coalescing buffer size of 100000. We found that 
the best predictor of performance is the amount of partial 
buffers (fewer is better) followed by full buffers (more is 
better). Partial buffers indicate periods of lack of work, and 
this, in turn, indicates that the local priority queues are 
getting depleted more often, decreasing overall performance. 
AM++ was originally optimized for algorithms like BFS and 
A-stepping, which benefit from eager optimization of commu¬ 
nication overhead and are not sensitive to work imbalance. 
Our example shows that optimization of runtime for a seem¬ 
ingly worthy goal can negatively impact applications that 
have other needs not anticipated by runtime developers. 

2.6 Performance Irregularity 

A large supercomputer runs jobs on a complex topology, 
with an allocation system designed to maximize utilization 
and ensure fairness. To satisfy these goals and to account 
for complex topologies, jobs may be mapped to hardware in 
different ways between different runs, and hardware resources 
may be shared by jobs in different degrees depending on the 
current job configuration. [Figure 8\ shows scaling results 
for DC together with our implementation of A-stepping for 
comparison. DC experiences a drop in performance after 
96 nodes, as does A-stepping, although DC recovers faster 
than A-stepping. A similar drop can be observed for scale 
30, but it occurs later, at 112 nodes. DC for scale 29 also 
shows a drop in performance after 104 nodes, but it is not 
as dramatic as at larger scales. Gurrently, we do not have 
an explanation for performance variability, but we suspect 
the effects of job placement as discussed in [Sec. 3.2.3| where 
at some node counts a job on a certain input does not map 
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Figure 8: Irregularities in strong scaling on Big Red 2. 


well to the underlying topology. 

Another kind of irregularities are not as consistent, and 
they depend on transient conditions such as availability of 
contiguous blocks of nodes and interaction with other jobs. 
For example, on Edison, we noticed that some jobs we run 
more than once took up to 25% more time in some runs. 

2.7 Caching 

We implemented a write-through cache with the most re¬ 
cent SSSP messages. When a message to a given destination 
is discovered in cache, it is either discarded if the previous 
message had smaller distance, or it replaces the old message 
in cache and is sent through. Caching can improve perfor¬ 
mance dramatically, as we observed in our implementation 
of A-stepping. However, for DC, with caching performance 
deteriorates. For example, on runs on scale 30 graph on Big 
Red 2, DC performed by almost 30% worse with caching 
despite reducing work by about 10%. The interesting con¬ 
clusion is that an optimization technique can be actually 
detrimental to the performance of an algorithm, which can 
be counterintuitive. 
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techniques, the cost of synchronization is hard to control and 
eliminate. In this regard, an underlying runtime can have 
a signihcant impact. The more an algorithm depends on 
keeping global information about the runtime (e.g., for load 
balancing), the higher the costs of synchronization necessary 
to maintain that information. In [Figure 9| we count a task 
as rejected when the vertex distance in delivers is higher 
than what is already recorded and, consequently, the task 
is not inserted into the priority queue of DC or a bucket of 
A-stepping. Invalidated tasks are similar to rejected tasks, 
but their distance expires while they wait in priority queue. 


3. THE ANATOMY OF DGAs 

In previous section we show the interactions between run¬ 
time and the application cannot be neglected. Next we 
analyze the existing research on distributed BFS and SSSP 
problems. We provi de the a natomy of DGAs, c onsisting 
of application-level (Sec. 3.1) and runtime-level ( |Sec. 3.2| ) 
aspects. As explained in [Sec. 1| this division is based on 
how the research results are presented in the held. Authors 
usually concentrate on the application-level aspects, framing 
their contributions at that level, and treat the interactions 
with runtime as secondary results, often providing incom¬ 
plete information about it. The purpose of our analysis is to 
describe the complexity of interactions between application 
and runtime and to provide a blueprint for a more complete 
treatment of DGAs. 


2.8 Work vs. Overhead 

Performance of an algorithm depends on the amount of 
work it performs and on the amount of overhead that this 
work incurs in a given runtime. [Figure 9| shows the work 
statistics comprising of useful work, useless work, rejected 
work and invalidated work for DG and our implementation 
of A-stepping with scale 31 graph. While DG performs 
better than A-stepping, DG always executs more work than 
A-stepping in the most efficient conhgurations of both of 
the algorithms. As can be seen from [Figs. 8[ and[9l despite 
consistently performing 10%-25% more work, DG performs 
better in all instances of tests at scale. This shows that 
synchronization and uneven distribution of work have an 
important effect on the performance of DGAs. While one 
can attempt to mitigate the work imbalance with algorithmic 


3.1 Application-Level aspects of DGAs 

[Table l| summarizes the set of parameters we identified 
as the application-level aspects of DGAs, and divide them 
into 4 categories. The approach category is about the main 
algorithmic choices, the algorithmic considerations category 
covers the main aspects of the approach, and the categories 
of graph representation and data struetures cover the data 
structures that are used. 


3.1.1 Approach 

The approaches fall i nto thr ee main categories: unordered, 
ordered, and mixed (cf.. Sec. 2). Unordered algorithms expose 
parallelism and decrease the need for synchronization. DC 
orders work locally, in private priority queues, and HSync 
does not order work at all (chaotic traversal) in its unordered 
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Level-Synchronous^^ 
Bellman- FordESl 
Combinatorial BLASElISlIIl 
HybridES] 

HSyncISI 

Distributed controlEH 

klaEH 

A-stepping El EH 


Algorithmic Considerations 


Sec. 3.1.2 


Data Distribution 


Optimizations 


Load Balancing 


2DlIlinil32] 

Edge list^^I^ 

GhostsEninniiiiiiiii 

Direction optimization^^I^^i^ti^ 

PruningEHl 

Priority messages^^ 

Tree-based broadcast, reduction, 
and filtering^^ 

Per-thread work splitting^J 
Random shuffling of 
vertex identifiers^^ 
DelegatesI^ 

Proxies^ 


Graph Representation 


Sec. 3.1.3 


compressed coarse-index adjacency list^J, 
skip listE^, doubly-compressed sparse column (DCSC)^^I^ 


Data Structures (algorithm progress) 


Sec. 3.1.4 


Distributed async visitor queue^^^^, thread-local 
priority queue^^^^, dynamic-array buckets^^ 


Table 1: Application-Level aspects of DGAs. 


phase. Because of the lack of global ordering, both of these 
approaches are strongly affected by the order of task exe¬ 
cution and message order (e.g., see Sec. 3.2.2). A-stepping, 
KLA and Hybrid (which executes A-stepping in unordered 
phase) utilize rounds of unordered computation separated 
by global barriers, where the computation is divided into 
buckets based on a meta property (e.g., distance in SSSP) 
in A-stepping or on graph topology in KLA. Thanks to the 
use of global rounds, these approaches are less susceptible 
to task and message ordering, but they can exhibit straggler 
effect |33| where parts of the system are idle while other 
parts are still finishing the rounds. On the other hand, the 
ordered approaches rely on global barriers to order execu¬ 
tion, and because of that they are susceptible to straggler 
effect. Algorithm load balancing (Sec. 3.1.2) and runtime 
load balancing ( |Sec. 3.2 ) are important characteristics in such 
algorithms to distribute load evenly among processes and 
reduce straggler effect. Ordered algorithms naturally fit the 
BSP model [28], but careful implementations can efficiently 
overlap computation and communication (e.g., na). 


3.1.2 Algorithmic Considerations 
Data distribution. There are 3 main ways to distribute 
graph data among the processing elements. They are ID dis¬ 
tribution^ 2D distribution and Edge List Partitioning (ELP). 

A graph distribution makes use of certain runtime charac¬ 
teristics. For example, 2D distribution based algorithms com¬ 


monly use collectives from transport (discussed in jSec. 3.2.1| . 
2D distributions with collectives may suffer from memory 
scalability issues in small memory machines. To avoid such 
memory scalability issues, researchers have developed collec¬ 
tives using point-to-point communication [32]. 

A graph distribution is also connected to the logical topol¬ 
ogy (discussed in |Sec. 3.23 ) of the network, which defines the 
shape of the 2D distribution (square, rectangle etc.). Bulug 
et al. [9] showed that the logical topology of 2D distribution 
has an impact on the performance of the algorithm as it 
effects both communication and computation. They call it 
Processor Grid Skewness. In particular, “skewness” changes 
the number of processors in collective operation and also the 
size of the local data structure. 

Yoo et al. [^ showed that the number of processors par¬ 
ticipating in collective communication for ID distribution is 
0{p) whereas, in 2D distribution, it is reduced to 0{ytp) (p is 
the number of processors in the process group) and claimed 
2D is better for distributed BPS. But recently Checconi and 
Petrini m compared 2D distribution implemented in m 
with their ID distribution and found that the performance of 
the BPS implementation with 2D distribution is slower com¬ 
pared to ID because the 2D implementation did not employ 
runtime characteristics such as compression, communication 
and computation overlapping methods. Further they also 
showed scaling was limited in 2D due to cache thrashing. 

To overcome certain overhead incurred due to high degree 
vertices (in ID), Pearce et al. [25] introduced ELP. For ELP, 
we sort the edgelist of a graph according to their sources 
and partition them evenly. But with ELP, a vertex state 
may reside in multiple nodes (see Delegates under Load 
Balancing) and requires synchronization of states between 
nodes via reductions. 

Optimizations. Research community in this area uses 
various techniques to improve the efficiency of a proposed 
algorithm. 

Beamer et al. [5] introduced direction optimization to 
standard level synchronous BFS algorithm. Direction op¬ 
timization reduces the number of vertices to be visited by 
traversing vertices in a bottom up fashion. Beamer et al. [5] 
showed that direction optimization reduces contention for 
some atomic operations as bottom up approach removes the 
need for atomic operations (because only child writes to itself, 
hence removing the contention). Though direction optimiza¬ 
tion is a promising approach to improve performance, the 
initial method suffered from several runtime difficulties [6]. 
Each processor needs to check the membership of the frontier 
set but frontier set is too large to replicate across nodes. 
In addition, each vertex requires searching for its parent 
sequentially. To solve the hrst problem authors of [6] used 
2D distribution and to tackle the second problem they used 
systolic shifts. 

High degree vertices are challenging for distributed graph 
algorithms. Ghost vertices is an optimization technique to 
alleviate overhead incurred by high degree vertices . Ghost 
vertices reduce some remote communication by locally storing 
the distributed state of high degree vertices. For larger 
graphs, ghost vertices can limit the memory scalability. 

In SSSP, when an algorithm detects a vertex with rel¬ 
atively lower distance the processing node can relax that 
vertex with high priority. Messages generated using such 
vertices are called priority messages. During the execution 
of an algorithm, if a node observes that a message with high 






























priority (e.g., better vertex distance in SSSP) is received, 
it is processed immediately bypassing other intermediate 
data structures. Runtime support is necessary for this opti¬ 
mization to be successful. The priority messages should get 
delivered to respective owners swiftly. Therefore the sender 
needs to maintain a separate channel for priority messages. 
Further if runtime is supporting message aggregation (coa¬ 
lescing) , the priority messages should maintain low coalescing 
buffer size compared to normal buffers. 

Load Balancing. Load balancing in general attempts to 
distribute work among participating processes uniformly. In 
the following we discuss few strategies for load balancing. 

Per thread work splitting & Proxies: Chakaravarthy et al. 
m used two tier mechanism: intra-node thread level and 
inter-node vertex-splitting strategies, based on proxies, to 
achieve load balancing. In intra-node thread level load bal¬ 
ancing, besides the owner thread of a high degree vertex, 
other threads can participate in the relax operation for the 
edges involved. In-node NUMA characteristics affect intra¬ 
node load balancing. For inter-node processing, the authors 
employed proxies. Proxies reside in different localities than 
the high degree vertex. Proxies are connected to the high de¬ 
gree vertex with 0 weight. Time for processing a high degree 
vertex connected with proxies depends on the job placement, 
i.e. if a proxy is connected through a slow channel, the time 
to process the vertex relevant to the proxy is increased. 

Random shuffling of vertex identifier: Bulug and Mad- 
duri [8] achieved a reasonable load balancing by randomly 
shuffling all the vertex identifiers prior to partition. In the 
runtime, random shuffling changes the algorithm communi¬ 
cation pattern. This helps to share runtime resources near 
equally among all processes. 

Delegates: Pearce et al. m used delegates to distribute 
edge lists of high-degree vertices across multiple processes. 
While partitioning, the outgoing edges are placed at the 
edge’s target vertex location. Then one of high degree vertex 
owners is designated as a controller and others are assigned 
as delegates. The controller maintains the state of the vertex 
and delegates keeps a copy of the updated state. Delegates 
communicate with each other using asynchronous broadcast 
and reduction operations. According to the authors of [26], 
distributed delegate technique is more efficient compared to 
ELP in terms of communication reduction. 

3.1.3 Graph Representation 

In the literature we find two main techniques to represent 
a graph. They are Adjacency list and Compressed sparse row 
(CSR). The CSR representation minimizes the space needed 
using a row indexing mechanism. Compared to adjacency 
list representation, CSR is a more compact and efficient data 
structure but less versatile. 

Edmonds et al. m showed that, due to compact size (less 
memory compared to adjacency list) and efficient access meth¬ 
ods (direct indexing), the CSR implementation outperforms 
adjacency list representation. Very recently, Bulug et al. [9] 
showed that, at large scales (33 and above), the CSR repre¬ 
sentation does not scale with memory. Therefore to represent 
larger graphs, [9] used Doubly Compressed Sparse Column 
(DCSC). As per the authors, CSR representation is faster 
than DCSC. They also demonstrated that CSR performance 
is affected by in-node multithreading and NUMA behaviors. 
The authors received 15-17% performance improvement when 
executed with multi-threading within NUMA domains with 


the CSR implementation. 

Storing adjacencies in linked lists incurs cache misses on 
traversal (pointer chasing of list nodes). To avoid sequential 
navigation Checconi and Petrini HU used indexed skip lists 
for adjacency lists. Skip lists contain shortcuts to navigate 
from one source vertex to the following. When a few vertices 
need to be visited, a coarse index allows to skip large portions 
of adjacency list. However Skip Lists need more memory 
to store additional indexing data (compared to standard 
adjacency list representation). 

3.1.4 Data Structures for Algorithm Progress 

Algorithms use data structures to maintain intermediate 
state. Pearce et al. [25] used a visitor queue that is im¬ 
plemented as a collection of priority queues, to maintain 
adjacency vertices of an already visited vertex. A prior¬ 
ity queue for a vertex is selected based on a hash function. 
Using multiple threads with a hash function reduced lock 
contention. We used thread local priority queues for DC 
SSSP [33]. We took this approach to avoid contention on 
priority queues. To represent frontier vector for BFS, Bulug 
and Madduri [8] used a thread-local stack. These are merged 
into frontier set at the end of each iteration. Though this 
approach reduces contention, it needs additional memory to 
occupy thread-local stacks and copying also takes time. In 
our A-stepping [33] implementation we used a shared mem¬ 
ory bucket data structure that is based on arrays. Atomics 
were used to avoid any race conditions when making changes 
to buckets. 


3.2 Runtime-Level aspects of DGAs 

In|S ec. 2| we show how different runtime-level parameters 
effect the performance of DGAs. In this section, we categorize 
important runtime-level parameters based on the review of 
existing literature for DGAs. [Table 2| summarizes the set of 
low-level runtime parameters. 


3.2.1 Communication Paradigm 

The choice of communication paradigm can have a notable 
impact on performance of DGAs. Each paradigm imposes dif¬ 
ferent tradeoffs in terms of memory constraints, synchroniza¬ 
tion overhead and network latency. The collectives paradigm 
is used when large low-overhead stages of all-to-all communi¬ 
cation are needed, point-to-point paradigm allows for finer 
overlap between computation and communication at the 
expense of code complexity, and active messages are a re¬ 
finement of point-point communication that adds an implicit 
execution of handlers on remote objects. Einally, one-sided 
paradigm provides remote memory operations (GET, PUT, 
etc.) which are very efficient, but require remote memory 
management protocol. For example, collectives are the base 
of BLAS approaches and level-synchronous approaches. How¬ 
ever, Checconi and Petrini m show how using lightweight 
point-to-point communication may lead to improvements in 
traditionally synchronous approaches. They compare their 
point-to-point implementation using low level SPI interface 
(discussed in Sec. 3.2.2) to an MPI implementation using col¬ 
lectives. They note the large memory footprint required for 
collective buffers, which forces them to decrease the scale of 
the problem per node. Furthermore, collectives do not allow 
for easy interleaving of computation with communication. 
Given these two factors and the overhead of MPI over SPI, 
they note a fivefold decrease in performance. Active messages 







Communication Paradigm 


Sec. 3.2.1| 


Point-to- PointE^^^, Collectives!^ 
One-Sided, Active messages^ 


Transport 

Sec. 3.2.2| 

Request Tracking 

Progression^! : 

• Asynchronous 

• Synchronous 

Remotely synchronous 

Locally synchronous^! 

Asynchronous!^ 

System threads!^ 

User threads 

Explicit progress!^ 

Runtime scheduler!^ 

Lightweight task!^ 

Bit Transport 

Protocol 

System Processing Interface (SPI)^!^!^! 
Message Passing Interface (^MPI)fe!!^^ 
Active Pebbles (AM-|--|-)!^ 

ARMlin!!!!! 

Eager, rendezvous!^, completion^! ^! 
Message reduction/caching!^!^!^ 
Message coalescing!^ !^!^ 

Message compression^! 

2D, 3d!^I^, ring!^, hypercube, rook!l^ 

Optimization 

Message routing 

Threading 

Multi-threaded!^, serialized, funneled 

Network Topology Sec. 3.2.3| 

Physical Topology 
Logical Topology 
Job topology 

3DE3I11] and 5 D!h! torus, Dragonfly!^ !^ 
Skewness!^, synthetic network!^ !^ 

Job allocation, rank mapping!^ 

Local Scheduling 

Sec. 3.2.4| 

Heavyweight 

Lightweight 

Pthreads!^, OpenMp!H!!n! 

Work stealing, FIFO tasks!^ 

Termination 

Quiescence detection!^, SKr!^ 

Hardware effects 

L2 atomics!^^, NUMA effects!^ 

Runtime Feedback Sec. 3.2.5| 

Optimal K-level^!, 

optimal A, sync to async switching!^ 


Table 2: Runtime-Level aspects of DGAs. 


are based on point-to-point communication, and they display 
similar communication performance characteristics (indeed, 
m implements rudimentary active messages using low-level 
system interfaces). However, AM++ provides a full scale of 
active message services, such as message routing, message 
reduction, and object-based addressing, and automatic exe¬ 
cution of handlers [29], which requires a runtime scheduler 
which may have an important impact on the performance of 
a DGA dSec. 2| ). 

3.2.2 Transport 

The transport layer is the part of the stack responsible for 
sending and receiving bits. Important properties of trans¬ 
port include how message buffers are handled, which entity 
manages them, and how frequently they need to be managed. 
The runtime needs to take several decisions regarding these. 

Request Tracking (RT). refers to how communication 
requests are made into transport: a request is scheduled and 
needs to be completed later [asynchronous)^ a request is 
made and the requester waits until all local data structures 
can be reused [locally synchronous)^ or a request is made and 


the requester waits until it has been completely processed 
[remotely syehronous). As an example, remotely synehronous 
request tracking in MPI (e.g., by MPI_Ssend etc.) guarantees 
a small number of messages on the network, but it hinders 
parallelization. On the other hand, using loeally synehronous 
may allow more parallelism if the underlying implementa¬ 
tion uses an eager protocol (e.g., MPI_Send). Finally, asyn¬ 
chronous RT uses interfaces such as MPI_Isend/MPI_IRecv 
to start requests along with MPLTestsome to check for their 
completion, maximizing overlap between computation and 
communication. However, with the asynchronous RT, the 
client of a transport must make decisions about how many 
requests to keep opened at any given time, how many to 
check, at any given time. Checconi and Petrini m used 
asynchronous RT, maintaining one buffer per destination. A 
send operation queues messages on these buffers until the 
buffer is full, and then hands it over to the network interface. 
But before starting to write to the buffer again, their im¬ 
plementation has to wait for the previous send to complete 
so the buffer becomes available. In [33], AM++ spawns a 
constant number of receive requests and as many sand re¬ 
quests as necessary, all stored and tracked in one array of 
MPI requests. 

Progression. Completing a round trip through transport 
requires a protocol for dedicating resources to the transport 
for bookkeeping, performing bit moving, and delivering the 
results of completed requests to the application—we call that 
protocol progression. Progression influences the timeliness 
and efficiency of transport delivery, and a wron g progres - 
sion model can render an application infeasible (Sec. 2.4). 
In asynchronous progression, computing resources are ded¬ 
icated to make progress. The resources can be dedicated 
through system or user threads. For example, Cray MPI 
provides an option for starting progression pthreads that 
perform internal MPI progress in parallel with the applica¬ 
tion threads. User threads serve similar purpose but are 
started by the user explicitly, and, for example, call MPI 
repeatedly to generate progress. In contrast, synchronous 
progression is done periodically in the runtime or application 
level. In explicit asynchronous progress, the application can 
choose explicitly, bypassing the runtime scheduler (if any), 
when to call progress, enabling optimizations at the cost of 
complexity. For example, we employed explicit polling for 
our DC, but we observed observed a decrease in performance. 
In a task-based system, network progress can be scheduled 
as a lightweight task. For example, AM++ implements net¬ 
work polling, buffer flushing, checking for termination, and 
executing pending handlers for received messages as tasks, 
on equal footing with application task that run message han¬ 
dlers. Finally, a runtime with a scheduler, such as AM++, 
can perform progress directly in the scheduler, which allows 
for more control and runtime feedback when performing pro¬ 
gression. Most papers do not discuss progression and request 
tracking explicitly, but the choices made for these parameters 
may have a profound effect on performance (cf.. Sec. 2.4 for 
a motivating example). 

We illustrate the progression and request tracking by briefly 
comparing our DC in AM++ with the work of Checconi and 
Petrini [12] . |Checconi and Petrini used a lightweight asyn¬ 
chronous communication layer on top of System Processing 
Interface (SPI), with separate FIFOs for injections and recep¬ 
tions (up to 16 FIFOs each per node, providing network-level 
parallelism). They queue messages to per destination buffers 


























where each buffer is exclusively owned and operated on by a 
single thread, eliminating locking and contention. Buffers are 
placed into injection queues when ready, and a thread will 
wait for completion when another message needs to be sent 
to a given destination. AM++ also maintains per-destination 
buffers. The buffers are shared and require atomic operations 
for writing. Compared to |Checconi and Petrini| AM++ does 
not wait for the send buffers to become available. Instead, 
it creates new buffers and can spawn multiple asynchronous 
send requests for the same destination. AM++ supports 
three different progression models. Polling can be invoked 
explicitly from the algorithm. Explicit polling bypasses the 
AM++ task queue, and directly queries the outstanding send 
and receive requests (with MPI_Testsome on an array of re¬ 
quests). AM-f+ also has special purpose user-level tasks that 
perform progression. These tasks are executed from AM-I--I- 
task queue when sends are performed, or when end-of-epoch 
tests are performed (during these tests AM++ tries to finish 
an epoch by processing remaining work). 

Bit transport. Bit transport is the lowest-level network 
interface used by upper levels to deliver bits from one loca¬ 
tion to another. In the work from the IBM group [ini[ii[i3], 
the System Processing Interface (SPI) communication layer 
serves as a bit transport (as described above). The majority 
of implementations we have discussed use Message passing 
Interface (MPI) for their bit transport. SPI is a direct inter¬ 
face to hardware queues, while MPI is a complex framework 
with extra functionality and semantics. Direct interfaces like 
SPI may yield more efficient communication, but are less or 
not at all portable, and may require more implementation 
effort to implement higher-level features. The third type of 
bit transport is based on remote method invocation (RMI) 
technique and is used in approaches based on STAPL nmn], 
a generic parallel library for graph and other data structures 
and algorithms. STAPL uses the ARMI (Adaptive Remote 
Method Invocation) active-message communication library, 
based on RMI. ARMI supports automatic message coalescing 
but does not provide routing or message reductions natively. 
Both AM-f-h and ARMI can use different backends, so from 
the application’s perspective they are bit transport them¬ 
selves, but internally they use a lower-level bit transport. 
Such layering makes it hard to optimize parameters related 
to bit transport (e.g., choosing proper coalescing size for 
AMH—h). 

Protocol. Bit transport may employ different protocols 
for different message sizes. For example, MPI point-to-point 
communication, may support eager protocol for small mes¬ 
sages and rendezvous protocol for larger transfers, sending 
messages without or with, respectively, round-trip commu¬ 
nication [3]. The autonomous and transparent choice of 
protocols may have a detrimental impact on applications 


(e.g.. Sec. 2.2| . 


Optimization. A number of runtime-level optimization 
techniques have been proposed in the literature to reduce com¬ 
munication overhead and maximize throughput. In [Sec. 27f| 
we discussed message reduction (caching) in AM-I--I-. Pearce 
et al. m used tree based broadcast, reduction and filter¬ 
ing for communication involving high degree vertices. This 
essentially forces the visitors to traverse the delegate (cf.. 


Sec. 3.1.2) tree and provides the opportunity to filter out 
messages. Panitanarak and Madduri [23] used local lookup 
arrays to track the tentative distance of every vertex, thus 
avoiding duplicate request being sent. 


Increasing message coalescing (cf., |Sec. 2.2| ) size increases 
the rate at which small messages can be sent over a network 
at the cost of latency. Checconi and Petrini m used co¬ 
alescing to pack together all the edges that would be sent 
to each destinat ion separately and queued them in an inter¬ 
mediate buffer. Pearce et al. EHES] combined coalescing 
with routing to reduce dense communication. Based on the 
observation that bisection bandwidth becomes a limiting 
factor for DGAs, Checconi and Petrini m utilized the avail¬ 
able processing power to compress their buffers of coalesced 
messages, using differential encoding scheme for compression 
of vertices. They reported that compression decreased the 
number of messages sent in intermediate BPS levels where 
high message traffic is present, but when the message traffic 
is low, compression degrades the performance. 

Message Routing. Pearce et al. [25] implemented rout¬ 
ing through a synthetic network to mimic the BG/P 3D torus 
interconnect topology. In a followup paper m, the authors 
additionally embedded the delegate tree as a means for fur¬ 
ther communication reduction. AM++ [29] also supports 
two types of software routing strategies: Rook routing and 
Hypercube routing. Rook routing reduces the number of com¬ 
municating buffers to 0{y/p) [15]. However, a disadvantage 
of software routing is that it increases message latency. Yoo 
et al. [32] used ring communication in their optimized collec¬ 
tive implementation and adjusted the diameter of the ring 
to achieve better performance. 

Threading. A message passing framework can support 
different level of thread safety. For MPI, there are 4 levels in 
total [2]. Single, Funneled, Serialized and Multiple. In our DC 
implementation [33], we used Multiple as the threading level 
with the aim to have the maximum flexibility for multiple 
threads to invoke MPI functions concurrently. 


3.2.3 Network Topology 

Gomputing resources are organized in several specific phys¬ 
ical topologies: 3D Torus, Dragon Fly, 5D Torus etc. The 
physical topology has a connection to collectives. For exam¬ 
ple, MPICH2 provides an all-to-all implementation that is 
optimized for Aries and Gemini systems. In m, the authors 
stated that the triangle of virtual processors are embedded 
in the Blue Gene/Q physical topology which is a 5D torus. 

Some of the graph traversal algorithms with 2D distri¬ 
bution make use of the pattern of communication between 
ranks and that pattern is called logical topology. In |Sec. 3.1.21 
we discussed Processor Grid Skewness. In relation to skew¬ 
ness, Buluc et al. [9] found out that the “tall skinny” grids 
performed faster and “short fat” grids performed worse than 
square grids. Pearce et al. [25] used a 3D routing topology 
to mirror the Blue Gene /P 3D torus interconnect topology. 
In this way they created a synthetic network to implement 
routing and aggregation. 

Job scheduler for computing resources allocate nodes based 
on scheduling policy and the input request. This formulates 
the job topology. Bhatele et al. [3 showed that performance 
of an application depends on job placement. Specially in 
Gray systems the variation can be significant. The variations 
can be due to the distances among allocated nodes or due to 
contention on shared network. 


3.2.4 Local Scheduling 

Depending on the node-level threading mechanism, thread 
scheduling policies and synchronization primitives, tasks 


















associated with a DGA can execute in different order with 
varying frequencies. For example, in an attempt to quickly 
spread good work, we can send a message with priority and 
put the message handler infront of the task queue. This is 
one way to achieve priority scheduling [33]. Data structures 
support (for example bitmaps in sync mode and global queue 
in async mode in m) is also an important factor to achieve 
effective local scheduling. Below we discuss several thread- 
granularity and scheduling related factors. 

Heavyweight threads (worker threads) can be used for intra¬ 
node threading. Bulug and Madduri [8] used MPI for inter 
node processing and GNU OpenMP for intra-node thread¬ 
ing. Bulug et al. [9] also used OpenMP threading in their 
implementation as it is beneficial for algorithms implemented 
using BLAS. But DC based unordered algorithms need more 
control over threading. Therefore, m used a combination 
of MPI and pthreads. 

Lightweight threads, implemented on top of kernel threads, 
can be scheduled differently. Lightweight thread scheduling 
mechanisms achieve load balancing mostly by work stealing 
and FIFO scheduler for tasks. 

The frequency of termination detection (TD) is also a very 
important catalyst for DGA performance. This is especially 
true for unordered algorithms. In AM++ [33], the termina¬ 
tion detection is implemented in the runtime level with the 
help of non-blocking collectives and work balancing FIFO 
queue. Hribar et al. [19] implemented TD in algorithm level 
and advocated that, when the computation time becomes 
smaller than the time it takes to receive a message, high 
detection frequency should be used. Otherwise low detection 
frequency should be used. 

Hardware Effects such as NUMA, L2 atomics impact on the 
in-node execution of distributed graph algorithms. Ghakar- 
avarthy et al. [TO] exploited atomic operations implemented 
in the L2 cache to achieve better aggregate update rate in 
relaxation. [81191 [21] also used atomic updates to mitigate 
synchronization costs. Also, Bulug et al. [9] considered the 
effect of multithreading within NUMA domains and observed 
noteworthy performance gain compared to Flat-MPI runs. 

3.2.5 Runtime Feedback 

Ghoosing algorithmic parameters adaptively, switching 
between different algorithms during execution and choosing 
the mode of algorithms (sync vs async) depend heavily on 
the feedback provided by the runtime. For example, in nil, 
to adaptively determine and set the level of asynchrony. A;, in 
each superstep, a set of conditions are being evaluated to take 
the decision about whether to double the size of k or not. One 
of the conditions checks whether the penalty for asynchrony 
(assessed in terms of wasted work) has exceeded a threshold 
or not. Based on the support of runtime, effect/propagation 
of wasted work can be mitigated by propagating better work 
from lower level and thus invalidating wasted work even 
before processing them in the algorithm level. If this scheme 
can be put into place in runtime, we can expect different 
adapted value of k compared to the one without the scheme. 
Similarly, Xie et al. m used throughput ie. the amount of 
vertices processed per unit time as a metric to switch between 
sync and async mode of algorithms. But the throughput 
depends on how efficiently the underlying runtime is being 
employed. 


4. CONCLUSIONS 

We demonstrate clearly and explicitly that the application- 
level parts that are reported as major contributions do not 
constitute a complete description of a DGA. We show how 
sometimes small changes in runtime can threaten the via¬ 
bility of an approach. We aim to raise awareness of the 
importance of runtime. With this in mind, we first provide 
a representative overview of application-level aspects in BFS 
and SSSP. Then we carefully analyze runtime aspects, which 
are usually overlooked. Based on our analysis, we provide 
the “anatomy of DGAs”. The anatomy consists of two major 
layers: the application-level aspects and the runtime-level 
aspects, which respectively represent the top and the bottom 
of the software/hardware stack. Each is further subdivided 
into categories, and we provide examples from existing re¬ 
search. We propose a set of guidelines for reporting research 
design and results. Altogether, the goal is to make research 
results in DGAs more accessible, general, and congruent. To 
achieve this goal, we believe that research has to be pre¬ 
sented in the context of the whole complex stack on modern 
supercomputers (getting more complex with progress toward 
exascale). We intend the guidelines to serve the community 
as an initial step to iterate and expand on. 

Our [Tables 1| and [^ serve as an initial map for reporting 
the design features. It is as important to state which parts 
of DGAs anatomy are explicitly covered in the results as 
which are not. Some may remain “buried in the stack”, their 
impact unknown (for example, the effects of job placement 
Sec. 3.2.^ are not usually investigated), and some may 


not be relevant in a given situation. Our anatomy helps both 
consumers and authors of research, the former to understand 
and the latter to present contributions. 

Next, it is important to outline the experimental pro¬ 
tocol used to obtain the results. This amounts to stating 
which parts of the parameter space are covered by experi¬ 
ments and, as importantly, which are not. Furthermore, it 
is helpful for the reader to understand why certain parts of 
the parameter space are covered and others are not, even 
if the reason is as mundane as limited resources (burning 
through time allocations is easy). Also, it is often helpful to 
present negative experimental results if any were observed as 
they help to uncover to which parameters a given approach 
is sensitive. 

Our analysis and guidelines are intended to be only the 
first, imperfect step in unifying the field. We posit that the 
DGA research community should collectively develop a set of 
standards expected from top notch research, acknowledging 
that DGAs exhibit particularly strong interaction with the 
software/hardware stack due to their irregularity. Thus we 
appeal to the wider community to take our initial suggestion 
and to help develop standards for more explicit incorporation 
of runtime interactions in future research results. 
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